Search CORE

30 research outputs found

Parallel identification of the spelling variants in corpora

Author: Reynaert M.W.C.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2009
Field of study

Tilburg University Repository

Corpus-Induced Corpus Clean-up

Author: Reynaert M.W.C.
Publication venue: ELRA
Publication date: 01/01/2006
Field of study

Tilburg University Repository

Text-Induced Spelling Correction

Author: Reynaert M.W.C.
Publication venue: In eigen beheer
Publication date: 01/01/2005
Field of study

Tilburg University Repository

Character confusion versus focus word-based correction of spelling and OCR variants in corpora

Author: Reynaert M.W.C.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Springer - Publisher Connector

Tilburg University Repository

On OCR ground truths and OCR post-correction gold standards, tools and formats

Author: Reynaert M.W.C.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2014
Field of study

Tilburg University Repository

Text Induced Spelling Correction

Author: Reynaert M.W.C.
Publication venue: Unknown Publisher
Publication date: 01/01/2004
Field of study

We present TISC, a language-independent and context-sensitive spelling checking and correction system designed to facilitate the automatic removal of non-word spelling errors in large corpora. Its lexicon is derived from a very large corpus of raw text, without supervision, and contains word unigrams and word bigrams. It is stored in a novel representation based on a purpose-built hashing function, which provides a fast and computationally tractable way of checking whether a particular word form likely constitutes a spelling error and of retrieving correction candidates. The system employs input context and lexicon evidence to automatically propose a limited number of ranked correction candidates when insufficient information for an unambiguous decision on a single correction is available. We describe the implemented prototype and evaluate it on English and Dutch text, containing real-world errors in more or less limited contexts. The results are compared with those of the isolated word spelling checking programs Ispell and the Microsoft Proofing Tools MPT

Crossref

Tilburg University Repository

Synergy of Nederlab and @PhilosTEI: diachronic and multilingual Text-Induced Corpus Clean-up

Author: Reynaert M.W.C.
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/05/2014
Field of study

Tilburg University Repository

All, and only, the errors: More complete and consistent spelling and OCR-error correction evaluation

Author: Reynaert M.W.C.
Publication venue: ELRA
Publication date: 01/01/2008
Field of study

Tilburg University Repository

Non-interactive OCR post-correction for giga-scale digitization

Author: Gelbukh A.
Reynaert M.W.C.
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2008
Field of study

CLAM: Quickly deploy NLP command-line tools on the web

Author: Reynaert M.W.C.
van Gompel M.
Publication venue: Dublin City University and Association for Computational Linguistics
Publication date: 01/08/2014
Field of study

In this paper we present the software CLAM; the Computational Linguistics Application Mediator. CLAM is a tool that allows you to quickly and transparently transform command-line NLP tools into fully-fledged RESTful webservices with which automated clients can communicate, as well as a generic webapplication interface for human end-users

Tilburg University Repository